Indexing and stemming approaches for the Czech language
نویسندگان
چکیده
This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf idf vector-space approach. We found that the Divergence from Randomness paradigm tend to propose better retrieval effectiveness than the Okapi, LM or tf idf models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.
منابع مشابه
Stemming Approaches for East European Languages
During this CLEF evaluation campaign, the first objective is to propose and evaluate various indexing and search strategies for the Czech language that will hopefully result in more effective retrieval than language-independent approaches (n-gram). Based on the stemming strategy we developed for other languages, we propose that for the Slavic language a light stemmer (inflectional only) and als...
متن کاملUniversity of Chicago at the CLEF 2007 Cross Language Speech Retrieval Track
The University of Chicago participated in the CLEF 2007 CL-SR track, performing monolingual retrieval for both English and Czech and cross-language French-English retrieval. English experiments considered the impact of automatically generated keywords on retrieval. Czech experiments explored the effect of different stemming approaches on retrieval for this morphologically rich language. The bes...
متن کاملNamed Entity Recognition for Highly Inflectional Languages: Effects of Various Lemmatization and Stemming Approaches
In this paper, we study the effects of various lemmatization and stemming approaches on the named entity recognition (NER) task for Czech, a highly inflectional language. Lemmatizers are seen as a necessary component for Czech NER systems and they were used in all published papers about Czech NER so far. Thus, it has an utmost importance to explore their benefits, limits and differences between...
متن کاملIndexing and searching strategies for the Russian language
This paper describes and evaluates various stemming and indexing strategies for the Russian language. We design and evaluate two stemming approaches, a light and a more aggressive one, and compare these stemmers to the Snowball stemmer, to no stemming, and also to a language-independent approach (n-gram). To evaluate the suggested stemming strategies we apply various probabilistic information r...
متن کاملEvaluation of Stemming, Query Expansion and Manual Indexing Approaches for the Genomic Task
This paper describes our participation in TREC-2005 for the ad hoc Genomic track, in which we evaluate five different stemming approaches to performing domainspecific searches within a MEDLINE subset. We also evaluate the impact that manually assigned descriptors (MeSH headings) have on retrieval effectiveness. We design a domain-specific query expansion scheme and compare it with the more clas...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Inf. Process. Manage.
دوره 45 شماره
صفحات -
تاریخ انتشار 2009